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Abstract 



Low-distortion subspace embeddings are critical building blocks for developing improved random 
sampling and random projection algorithms for common linear algebra problems. Here, we show 
that, given a matrix A e M" xd , with n^> d, and ape [1, 2), with a constant probability, we can 
construct a low-distortion embedding matrix II E R poly ( d ) xd that embeds A p , the £ p subspace 



O 



spanned by A's columns, into (]R c '( poly ( d )) j | . || ); the distortion of our embeddings is only 



v) 

C(poly(d)), and we can compute IL4 in 0(nnz(A)) time, i.e., input-sparsity time. Our result 
generalizes the input-sparsity time £ 2 subspace embedding proposed recently by Clarkson and 

i— i Woodruff; and for completeness, we present a simpler and improved analysis of their construction 

for e 2 . These input-sparsity time £ p embeddings are optimal, up to constants, in terms of 

Q their running time; and the improved running time propagates to applications such as (1 ± e)- 

^3 distortion £ p subspace embedding and relative-error £ p regression. For £2, we show that a 

O (1 + e)-approximate solution to the £2 regression problem specified by the matrix A and a vector 

6el" can be computed in 0(nnz(A)+d 5 \og(d/e)/e 2 ) time; and for £ p , via a subspace-preserving 

t— I sampling procedure, we show that a (1 ± e)-distortion embedding of A v into ]R < -'( poly ( <i )) can be 

computed in C(nnz( J 4)dog n) time, and we also show that a (l+e)-approximate solution to the £ p 
regression problem min xeR d ||Ar — b\\ p can be computed in 0(nnz(A)dogn + poly(d) log(l/e)/e 2 ) 
time. Moreover, we can also improve the embedding dimension or equivalently the sample size 

£T) to 0(d 3+p / 2 log(l/e)/e 2 ) without increasing the complexity. 

O 

1 Introduction 

Regression problems are ubiquitous, and the fast computation of their solutions is of interest in 
many large-scale data applications. A parameterized family of regression problems that is of par- 
ticular interest is the overconstrained £ p regression problem: given a matrix A G M. nxd , with n > d, 
a vector b G M n , a norm || • || p parameterized by p G [l,oo], and an error parameter e > 0, find a 
(1 + e)-approximate solution ieR d to: 

f* = mm\\Ax-b\\ p , (1) 



V 



i.e., find a vector x such that \\Ax — b\\ p < (1 + e)/*, where the £ p norm of a vector x is \\x\ 
(Si defined to be maxj |xj| for p = 00. Special cases include the £2 regression problem, 

also known as Least Squares Approximation problem, and the l\ regression problem, also known as 
the Least Absolute Deviations or Least Absolute Errors problem. The latter is of particular interest 
as a robust estimation or robust regression technique, in that it is less sensitive to the presence of 
outliers than the former. We are most interested in this paper in the l\ regression problem due to 
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its robustness properties, but our methods hold for general p E [1,2], and thus we formulate our 
results in £ p . 

It is well-known that for p > 1, the overconstrained £ p regression problem is a convex optimiza- 
tion problem; for p = 1 and p = oo, it is an instance of linear programming; and for p = 2, it can be 
solved with eigenvector-based methods such as with the QR decomposition or the Singular Value 
Decomposition of A. In spite of their low-degree polynomial-time solvability, £ p regression problems 
have been the focus in recent years of a wide range of random sampling and random projection 
algorithms, largely due to a desire to develop improved algorithms for large-scale data applica- 
tions [31 \2'6\ ITU]. For example, Clarkson [§] uses subgradient and sampling methods to compute an 
approximate solution to the overconstrained l\ regression problem in roughly 0(nd 5 logn) time; and 
Dasgupta et al. [H] use well-conditioned bases and subspace-preserving sampling algorithms to solve 
general £ p regression problems, for p £ [l,oo), in roughly 0(nd 5 logn) time. A similar subspace- 
preserving sampling algorithm was developed by Drineas, Mahoney, and Muthukrishnan |16j to 
compute an approximate solution to the £2 regression problem. The algorithm of [16] relies on 
the estimation of the £2 leverage score^] of A to be used as an importance sampling distribution, 
but when combined with the results of Sarlos [27] and Drineas et al. [T7] (that quickly prepro- 
cess A to uniformize those scores) or Drineas et al. |15j (that quickly computes approximations to 
those scores), this leads to a random projection or random sampling (respectively) algorithm for 
the £2 regression problem that runs in roughly O (nd log d) time \17\ I20j. More recently, Sohler and 
Woodruff [28j introduced the Cauchy Transform to obtain improved £\ embeddings, thereby leading 
to an algorithm for the £\ regression problem that runs in 0{nd}-™ + ) time; and Clarkson et al. [JO] 
use the Fast Cauchy Transform and ellipsoidal rounding methods to compute an approximation to 
the solution of general £ p regression problems in roughly 0{nd\ogn) time. 

These algorithms, and in particular the algorithms for p = 2, form the basis for much of the 
large body of recent work in randomized algorithms for low-rank matrix approximation, and thus 
optimizing their properties can have immediate practical benefits. See, e.g., the recent monograph of 
Mahoney [20J and references therein for details. Although some of these algorithms are near-optimal 
for dense inputs, they all require Vt{nd\ogd) time, which can be large if the input matrix is very 
sparse. Thus, it was a significant result when Clarkson and Woodruff [TTJ developed an algorithm 
for the £2 regression problem (as well as the related problems of low-rank matrix approximation 
and £2 leverage score approximation) that runs in input- spar sity time, i.e., in 0(nnz(A) +poly((i/e)) 
time, where nnz(^4) is the number of non-zero elements in A and e is an error parameter. This 
result depends on the construction of a sparse embedding matrix II for £2. By this, we mean the 
following: for annxd matrix A, an s x n matrix II such that, 

{l-e)\\Ax\\ 2 < \\UAx\\ 2 < (l + e)px|| 2 , 

for all x S M d . That is, II embeds the column space of A into M s , while approximately preserving 
the £2 norms of all vectors in that subspace. Clarkson and Woodruff achieve their improved results 
for ^2-based problems by showing how to construct such a II with s = poly(d/e) and showing 
that it can be applied to an arbitrary A in 0(nnz(^4)) time [11]. (In particular, this embedding 
result improves the result of Meng, Saunders, and Mahoney [23], who in their development of 
the parallel least-squares solver LSRN use a result from Davidson and Szarek |14j to construct a 

1 Recall that for an n x d matrix A, with n 2> d, the £2 leverage scores of the rows of A are equal to the diagonal 
elements of the projection matrix onto the span of A. That is, if A — QR is a QR decomposition of A, or if A = QSV T 
is the thin SVD of A, then the leverage scores equal the Euclidean norms squared of the rows of the n x d matrix Q, 
and thus they can be computed exactly in 0(nd 2 ) time. See [201 1 1 5] for details; and note that they can be generalized 
to £1 and other £ p norms JTD] as well as to arbitrary n x d matrices, with both n and d large, if one specifies a low-rank 
parameter [211 115] . 
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constant-distortion embedding for £2 that runs in C(nnz(j4) • d) time.) Interestingly, the analysis of 
Clarkson and Woodruff coupled ideas from the data streaming literature with the structural fact 
that there cannot be too many high- leverage constraints/rows in A. In particular, they showed 
that the high-leverage parts of the subspace may be viewed as heavy-hitters that are "perfectly 
hashed," and thus contribute no distortion, and that the distortion of the rest of the subspace as 
well as the "cross terms" may be bounded with a result of Dasgupta, Kumar, and Sarlos [13J . 

In this paper, we provide improved low-distortion subspace embeddings for £ p , for all p £ [1,2], 
in input-sparsity time; and we show that, by coupling with recent work on fast subspace-preserving 
sampling from [TU], these embeddings can be used to provide (1 + e)-approximate solutions to £ p 
regression problems, for p 6 [1,2], in nearly input-sparsity time. In more detail, our main results 
are the following. 

• For £2, we obtain an improved result for the input-sparsity time (1 db e)-distortion embedding 
of [11] . In particular, for the same embedding procedure, we obtain improved bounds for the 
embedding dimension with a much simpler analysis than [11]. See Theorem[l]of Section[3]for a 
precise statement of this result. Our analysis is direct and does not rely on splitting the high- 
dimensional space into a set of heavy-hitters consisting of the high-leverage components and 
the complement of that heavy-hitting set. In addition, since our result directly improves the 
£2 embedding result of Clarkson and Woodruff [llj, it immediately leads to improvements for 
the £2 regression, low-rank matrix approximation, and £2 leverage score estimation problems 
that they consider. 

• For l\ , we obtain a low-distortion sparse embedding matrix II such that HA can be computed 
in input-sparsity time. That is, we construct an embedding matrix II £ jjP°M d ) xri such that, 
for all x£R d , 

1/0 (poly (cf)) • \\Ax\\i < ||IL4x||i < £>(poly(d)) • \\Ax\\i, 

with a constant probability, and HA can be computed in 0(nnz(^4)) time. See Theorem [2] 
of Section [4] for a precise statement of this result. Here, our proof involves splitting the set 
Y = {Ux \ \\x\loo = 1, x 6 M. d }, where U is an l\ well-conditioned basis for the span of 
A, into two parts, informally a subset where coordinates of high £\ leverage dominate ||y||i 
and the complement of that subset. This £\ result leads to immediate improvements in £\- 
based problems. For example, by taking advantage of the fast version of subspace-preserving 
sampling from [10], we can construct and apply a (1 ± e)-distortion sparse embedding matrix 
for l\ in 0(nnz(A) ■ logn + poly(d/e)) time. In addition, we can use it to compute a (1 + e)- 
approximation to the £\ regression problem in 0{rarz{A) ■ logn + poly(d/e)) time, which in 
turn leads to immediate improvements in £i-based matrix approximation objectives, e.g., for 
the £\ subspace approximation problem (6J [28] [TO] . 

• For £ p , for all p £ (1,2), we obtain a low-distortion sparse embedding matrix II such that 
HA can be computed in input-sparsity time. That is, we construct an embedding matrix 

n g MP ol yWxn such that) for all x G M d ) 

l/0(poly(d)) • \\Ax\\ p < ||IL4x|| p < C7(poly(d)) • \\Ax\\ p , 

with a constant probability, and IL4 can be computed in 0(nnz(A)) time. See Theorem [4] 
of Section [5] for a precise statement of this result. Here, our proof generalizes the £\ result, 
but we need to prove upper and lower tail bound inequalities for sampling from general 
p-stable distributions that are of independent interest. Although these distributions don't 



3 



have closed forms for p G (1,2) in general, we prove that there exists an order among the 
Cauchy distribution, a p-stable distribution with p G (1,2), and the Gaussian distribution 
such that for all p G (1,2) we can use the upper bound from the Cauchy distribution and 
the lower bound from the Gaussian distribution. As with our £\ result, this £ p result has 
several extensions: in 0(nnz(A) ■ logn + poly(d/e)) time, we can construct and apply a 
(1 ± e)-distortion sparse embedding matrix for £ p ; in 0(nnz(A) • logn + poly((i/e)) time, we 
can compute a (1 + e)-approximation to the £ p regression problem; and in 0(nnz(j4) • dlogd) 
time, we can construct and apply a near-optimal (in terms of embedding dimension and 
distortion factor) embedding matrix. 

The (1 ± e)-distortion subspace embedding (for £ p , p G [1,2), that we construct from the input- 
sparsity time embedding and the fast subspace-preserving sampling) has embedding dimension s = 
0(poly((f) log(l/e)/e 2 ), where the somewhat large poly(<i) term directly multiplies the log(l/e)/e 2 
term. We can also improve this, showing that it is possible, without increasing the overall complex- 
ity, to decouple the large poly(d) and log(l/e)/e 2 via another round of sampling and conditioning, 
thereby obtaining an embedding dimension that is a small poly(d) times log(l/e)/e 2 . See Theorem[7] 
of Section [6] for a precise statement of this result. 



2 Background 

We use || • ||p to denote the l p norm of a vector, || • ||2 the spectral norm of a matrix, and | • \ p 
the element-wise £ p norm of a matrix. Given A G M. nxd with full column rank and p G [1,2], we 
use A p to denote the £ p subspace spanned by ^4's columns. In this paper, we are interested in fast 
embedding of A p into a d-dimensional subspace of (M poly ( d ), || • || p ), with distortion either poly(d) or 
(lie), for some e > 0, as well as applications of this embedding to problems such as £ p regression. 
We assume that n 3> poly(<i) > rf > logn. To state our results, we assume that we are capable of 
computing a (1 + e)-approximate solution to an £ p regression problem of size n' x d for some e > 0, 
as long as n' is independent of n. Let us denote the running time needed to solve this smaller 
problem by T p (e;n',d). In theory, we have T2(e;n',<i) = 0(n'dlog(d/e) + d 3 ) (see Rokhlin and 
Tygert [26] and Drineas et al. [Hj), and T p (e;n' , d) = 0((n'd 2 + poly(ff)) log(n'/e)), for general p 
(see, e.g., Mitchell [21]). 



Conditioning. The £ p subspace embedding and £ p regression problems are closely related to the 
concept of conditioning. We state here two related notions of £ p -norm conditioning and then a 
lemma that characterizes the relationship between them. 

Definition 1 (£ p -novm Conditioning (from [H3])). Given an n x d matrix A and p G [1, oo], let 

«r™ ax (A) = max \\Ax\L and cr™ in (^) = min \\Ax\L. 
p IM| 2 <i v Ha>i 

Then, we denote by n p (A) the £ p -norm condition number of A, defined to be: 

k p {A) = o^(A)/af»{A). 

For simplicity, we will use K p , o - ™ 111 , and crp 11 ^ when the underlying matrix is clear. 

Definition 2 ((a, /3,p)-conditioning (from |12j)). Given an n x d matrix A and p G [l,oo], let q 
be the dual norm of p. Then A is (a, /3,p)-conditioned if (1) \A\ P < a, and (2) for all z G M. d , 
\\ z \\q < /^II^MIp- Define R P (A) as the minimum value of a [3 such that A is (a, (3, p)- conditioned. 
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Lemma 1 (Equivalence of k p and k p (from |10j ) ) . Given an n x d matrix A and p E [1, oo], we 
always have 

d-\V 2 -Vp\ Kp (A) < R P (A) < d max{1/2 > 1/p} K p (A). 

Remark. Given the equivalence established by Lemma [TJ we will say that A is well-conditioned 
in the £ p norm if n p {A) or R P {A) = C(poly(d)), independent of n. 

Although for an arbitrary matrix A E M nxd , the condition numbers k p (A) and R P (A) can be 
arbitrarily large, we can often find a matrix R E VL dxd such that AR~ l is well-conditioned. This 
procedure is called conditioning, and there exist two approaches for conditioning: via low-distortion 
l p subspace embedding and via ellipsoidal rounding. 

Definition 3 (Low-distortion i p Subspace Embedding). Given an n x d matrix A and p E [l,oo], 
II E M sxn is a low- distortion embedding of A p if s = C(poly(d)) and 

l/0(poly(d)) • \\Ax\\ p < \\UAx\\ p < C(poly(d)) • \\Ax\\ p , Vx E R d . 

Remark. Given a low-distortion embedding matrix II of A p , let R be the "R" matrix from the 
QR decomposition of II A. Then, the matrix AR~ 1 is well-conditioned in the i p norm. To see this, 
note that we have 

WAR^xWp < 0(poly(d)) • [|nAR -1 a;||p < 0(poly(d)) • ||n^i2 _1 || 2 = C?(poly(d)) • ||x|| 2 , Vx E R d , 

where the first inequality is due to low distortion and the second inequality is due to s = C(poly(d)). 
By similar arguments, we can show that HAR-^llp > l/0(poly(d)) • \\x\\ 2 , Vx E R d . Hence, by 
combining these results, the matrix AR~ l is well-conditioned in the l p norm. 

For a discussion of ellipsoidal rounding, we refer readers to Clarkson et al. [1U] . In this paper, 
we simply cite the following lemma, which is based on ellipsoidal rounding. 

Lemma 2 (Fast 0(<i)-conditioning (from [10])). Given an n x d matrix A and p E [l,oo], it takes 
at most 0(nd 3 logn) time to find a matrix R E M. dxd such that k p {AR~ 1 ) < 2d. 

Subspace-preserving sampling and t p regression. Given R E M. dxd such that AR -1 is well- 
conditioned in the £ p norm, we can construct a (l±e)-distortion embedding, specifically a subspace- 
preserving sampling, of A p in C(nnz(A) dog n) additional time and with a constant probability. This 
result from Clarkson et al. [XQ|, Theorem 5.4] improves the subspace-preserving sampling algorithm 
proposed by Dasgupta et al. [12] by estimating the row norms of AR -1 (instead of computing them 
exactly) to define importance sampling probabilities. 

Lemma 3 (Fast Subspace-preserving Sampling (from |10j)). Given a matrix A E M nxd , p E [l,oo), 
e > 0, and a matrix R E M. dxd such that AR' 1 is well-conditioned, it takes 0{m\z(A) ■ logn) 
time to compute a sampling matrix S E M sxn (with only one nonzero element per row) with s = 
0(/^(AR~ 1 )dl p / 2 ~ 1 l +1 log(l/e)/e 2 ) such that with a constant probability, 

(1 - e)Px|| p < \\SAx\\ p < (1 + e)||Ax|| p , Vx E R d . 

Given such a subspace-preserving sampling algorithm, Clarkson et al. |10| Theorem 5.4] show that 
it is straightforward to compute a j3j- a PP rox i ma t e solution to an £ p regression problem. 

Lemma 4 (£ p Regression via Sampling (from pH]). Given an i p regression problem specified by 
A E M. nxd , b E M. n , and p E [1, oo), let S be a (1± e)-distortion embedding matrix of the subspace 
spanned by A 's columns and b from Lemma [3J and let x be an optimal solution to the subsampled 
problem min xgK d HS^x — <S6|| P . Then x is a ^^-approximate solution to the original problem. 
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Remark. Collecting these results, we see that a low-distortion l v subspace embedding is a funda- 
mental building block (and very likely a bottleneck) for (1 ± e)-distortion £ p subspace embeddings, 
as well as for a (1 + e)-approximation to an t v regression problem. This motivates our work and its 
emphasis on finding low-distortion subspace embeddings more efficiently 



Stable distributions. The properties of p-stable distributions are essential for constructing 
input-sparsity time low-distortion l v subspace embeddings. 

Definition 4 (p-stable Distribution). A distribution T> overM is called p- stable, if for any m real 
numbers a±, . . . , a m , we have 

1=1 Vi=l / 

where ~ T> and X ~ V. By "X ~ Y", we mean X and Y have the same distribution. 

By a result due to Levy [19], it is known that p-stable distributions exist for p £ (0,2]; and from 
Chambers et al. [7], it is known that p-stable random variables can be generated efficiently, thus 
allowing their practical use. Let us use V p to denote the "standard" p-stable distribution, for 
p € [1,2], specified by its characteristic function ip{t) = e~'*' P . It is known that T>\ is the standard 
Cauchy distribution, and that T>i is the Gaussian distribution with mean and variance 2. 



Tail inequalities. We note two inequalities from Clarkson et al. [TU] regarding the tails of the 
Cauchy distribution. 

Lemma 5 (Cauchy Upper Tail Inequality). For i = 1, . . . ,m, let Ci be m (not necessarily inde- 
pendent) standard Cauchy variables, and 7« > with 7 = X^7«- Let X = Yli7i\Ci\- For any 

L U ~ TTt V 1 - l/(7Tt) 

For simplicity, we assume that m>3 and t > 1, and then we have Pr[X > tj] < 2log(mt)/t. 

Lemma 6 (Cauchy Lower Tail Inequality). For i = l,...,m, let Ci be independent standard 
Cauchy random variables, and 73 > with 7 = X^7«- Let X = ^«7*|Cil- Then, for any t > 0, 



logPr[X < (l-t)7] < 



- 7 i 2 



3 maxj 7j 



We also note the following result about Gaussian variables. This is a direct consequence of Maurer's 
inequality ([22]), and we will use it to derive lower tail inequalities for p-stable distributions. 

Lemma 7 (Gaussian Lower Tail Inequality). For i = l,...,m, let Gi be independent standard 
Gaussian random variables, and 7i > with 7 = X]j7«- Let X = X^i7i|C«| 2 - Then, for any t > 0, 

logPr[X< (l-t)7] < ~ 7 * 2 • 

b maxj 7j 
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3 Main Results for £2 Embedding 



Here is our main result for input-sparsity time low-distortion subspace embeddings for £2. 

Theorem 1 ((1 ± e)-distortion Embedding for £2). Given a matrix A G M. nxd and e G (0, 1), let 

II = SD where S G M. sxn has each column chosen independently and uniformly from the s standard 
basis vectors ofM s and D G M nxn is a diagonal matrix with diagonal entries chosen independently 
and uniformly from ±1. Let s = (d + 3ci 3 )/e 2 . Then with probability at least 0.5, 

(l-e)||Ac|| 2 < ||IL4x|| 2 < (1 + e)\\Ax\\ 2 , Vx G R d . 
In addition, UA can be computed in 0(nnz(A)) time. 

The construction of II in this theorem is the same as the construction in Clarkson and Woodruff [llj. 
For them, s = 0((<i/e) 4 log 2 (<i/e)) in order to achieve (1 ± e) distortion with a constant probability. 
Theorem [l] shows that it actually suffices to set s = (d A + 3<i 3 )/e 2 , thus nailing dowm the constant 
factors in the big-0, removing the log 2 (<i/e) factor, and improving the dependence on e from 1/e 4 
to 1/e 2 . Surprisingly, the proof is rather simple: we apply Chebyshev's inequality to the elements 
of X = U T H T UU , where U is an orthonormal basis for A-2, to obtain a union bound on \X — I 
and we then use Gershgorin's circle theorem to bound the eigenvalues of X such that ||X — 1\\2 < e 
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See Appendix A.l| for a complete proof. 



Remark. The 0(nnz(A)) running time is indeed optimal, up to constant factors, for general inputs. 
Consider the case when A has an important row aj such that A becomes rank-deficient without it. 
Thus, we have to observe dj in order to compute a low-distortion embedding. However, without 
any prior knowledge, we have to scan at least a constant portion of the input to guarantee that 
dj is observed with a constant probability, which takes C(nnz(j4)) time. Note that this optimality 
result applies to general p. 

The results of Theorem [T] propagate to related applications, e.g., to the £2 regression problem, 
the low-rank matrix approximation problem and the problem of computing approximations to the 
£2 leverage scores. Since it underlies the other applications, only the I2 regression improvement is 
stated here explicitly; its proof is basically combining our Theorem [T] with Theorem 19 of |llj . 

Corollary 1 (Fast £2 Regression). With a constant probability, a (1 + e)- approximate solution to 
an £2 regression problem can be computed in 0(xmz(A) + T2(e; d^/e 2 , d)) time. 

Remark. Although our simpler direct proof leads to a better result for £2 subspace embedding, the 
technique used in the proof of Clarkson and Woodruff [llj, which splits coordinates into "heavy" 
and "light" sets based on the leverage scores, highlights an important structural property of £2 
subspace: that only a small subset of coordinates can have large £2 leverage scores. (We note that 
the technique of splitting coordinates is also used by Ailon and Liberty [I] to get an unrestricted 
fast Johnson-Lindenstrauss transform; and that the difficulty in finding and approximating the 
large-leverage directions was — until recently [201 IT5] — responsible for difficulties in obtaining fast 
relative-error random sampling algorithms for £2 regression and low-rank matrix approximation.) 
An analogous structural fact holds for £\ and other £ p spaces. Using this property, we can construct 
novel input-sparsity time £ p subspace embeddings for general p G [1,2), as we discuss in the next 
two sections. 



4 Main Results for £\ Embedding 

Here is our main result for input-sparsity time low-distortion subspace embeddings for l\. 
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Theorem 2 (Low-distortion Embedding for £{). Given A E M. nxd with full column rank, let H = 
SC E M sxri , where S E M sxn has each column chosen independently and uniformly from the s 
standard basis vectors of M s , and where C E M nxn is a diagonal matrix with diagonals chosen 
independently from the standard Cauchy distribution. Set s = ud 5 log 5 d with uj sufficiently large. 
Then with a constant probability, we have 

l/C(d 2 log 2 d) • < ||ILAz[|i < 0{d\ogd) ■ \\Ax\\ 1} ~ix E R d . 

In addition, UA can be computed in 0(nnz(A)) time. 

The construction of the £\ subspace embedding matrix is different than its £2 norm counterpart 
only by the diagonal elements of D (or C): whereas we use ±1 for the £2 norm, we use Cauchy 
variables for the £\ norm. The proof of Theorem [2] uses the technique of splitting coordinates, the 
fact that the Cauchy distribution is 1-stable, and the upper and lower tail tail inequalities regarding 
the Cauchy distribution from Lemmas [5] and [6j See Appendix |A.2 for a complete proof. 



Remark. As mentioned above, the 0(nnz(^4)) running time is optimal. Whether the distortion 
0(d 3 log 3 <i) is optimal is still an open question. However, for the same construction of II, we 
can provide a "bad" case that provides a lower bound. Choose A = {l& 0) T . Suppose that s 
is sufficiently large such that with an overwhelming probability, the top d rows of A are perfectly 
hashed, i.e., ||IL4x||i = Ylt=l \ c k\\xk\, Vrr E M. d , where ct is the fc-th diagonal of C. Then, the 
distortion of II is maxj.<^ \ck\/ min^d |c&| ~ 0(d 2 ). Therefore, at most an 0(cflog 3 d) factor of the 
distortion is due to artifacts in our analysis. 

Our input-sparsity time £\ subspace embedding of Theorem [2] improves the C(nnz(^4) • dlogd)- 
time embedding by Sohler and Woodruff |28| and the 0(nd log n)-time embedding of Clarkson et 
al. [10]. In addition, by combining Theorem [2] and Lemma [5J we can compute a (1 ± e)-distortion 
embedding in C(nnz(A) • logn) time, i.e., in nearly input-sparsity time. 

Theorem 3 ((l±e)-distortion Embedding for £\). Given A E M ri><d ; it takes C(nnz(A) - logn) time 
to compute a sampling matrix S E M. sxn with s = 0(poly(<i) log(l/e)/e 2 ) such that with a constant 
probability, S embeds A\ into (M s , || • ||i) with distortion 1 ± e. 

Our improvements in Theorems [2] and [3] also propagate to related £i-based applications, in- 
cluding the £\ regression and the £\ subspace approximation problem considered in [281 E2 • As 
before, only the regression improvement is stated here explicitly. For completeness, we present in 
Algorithm [lj our algorithm for solving £\ regression problems in nearly input-sparsity time. The 
brief proof of Corollary [2j our main quality-of-approximation result for Algorithm [TJ may be found 



in Appendix A. 3 



Algorithm 1 Fast £\ Regression Approximation in C(nnz(A) • logn + poly(<i) log(l/e)/e ) Time 
Input: A E R nxd with full column rank, b E R n , and e E (0, 1/2). 

Output: A (1 + e)-approximation solution x to min j;gIK d \\Ax — b\\i, with a constant probability. 
1: Let A = (A 6) and denote A\ the £\ subspace spanned by ^4's columns and b. 
Compute a low-distortion embedding II E R c '(P ol y( rf ))x™ G f Ax (Theorem || . 
Compute R E R( d + 1 ) x ( (i + 1 ) from HA such that ART 1 is well-conditioned (QR or Lemma [|). 
Compute a (1 ± e/4)-distortion embedding S E ^p{ V oly{d)\og{l/e)/^)xn of ^ ( Lemma |i[. 
Compute a (1 + e/4)- approximate solution x to mm x&R d \\SAx — Sb\\i. 



8 



Corollary 2 (Fast l\ Regression). With a constant probability, Algorithm^ computes a (1 + e)- 
approximate solution to an i\ regression problem in 0(nnz(j4) • logn + 7~i(e; poly(d) log(l/e)/e 2 , d)) 
time. 

Remark. For readers familiar with the impossibility results for dimension reduction in l\ [511181 15], 
note that those results apply to arbitrary point sets of size n and are interested in embeddings that 
are "oblivious," in that they do not depend on the input data. In this paper, we only consider 
points in a subspace, and the subspace-preserving sampling procedure of [12] that we use is data- 
dependent. 



5 Main Results for £ p Embedding 

In this section, we use the properties of p-stable distributions to generalize the input-sparsity time 
i\ subspace embedding to i v norms, for p G (1, 2). Generally, V p does not have explicit PDF/CDF, 
which increases the difficulty for theoretical analysis. Indeed, the main technical difficulty here is 
that we are not aware of i p analogues of Lemmas [5] and [6] that would provide upper and lower 
tail inequality for p-stable distributions. (Indeed, even Lemmas [5] and [6] were established only 
recently [TO]-) 

Instead of analyzing D p directly, for any p G (1,2), we establish an order among the Cauchy 
distribution, the p-stable distribution, and the Gaussian distribution, and then we derive upper and 
lower tail inequalities for the p-stable distribution similar to the ones we used to prove Theorem [2| 
We state these technical results here since they are of independent interest. We start with the 
following lemma, which is proved in Appendix A.4| and which establishes this order. 



Lemma 8. For any p G (1, 2), there exist constants a p > and (3 P > such that 

a p \C\ h \X p \p y (3 p \G\ 2 , 

where C is a standard Cauchy variable, X p ~ V p , G is a standard Gaussian variable. By "X yY" 
we mean Pi[X > t] > Pr[Y > t], Vt G M, i.e., F x {t) < Fy (t) , Vi 6 R, where F(-) is the 
corresponding CDF. 



Our numerical results suggest that the constants a p and j3 p are not too far away from 1. See 
Figure TJ which plots of the CDFs of | X p /2\ p for p = 1, 0, 1.1, . . . , 2.0, based on which we conjecture 
\Xpj2P 1 y \X P2 /2\P 2 , for all 1 < p 1 < p 2 < 2. This implies that 2P~ l \C\ y \X p \P and \X p \p y 
2P~ 2 1 2 ~ 2 P_1 |G| 2 , which therefore provides a value for the constants a p and j3 p . 

Lemma [8] suggests that we can use Lemma [5] (regarding Cauchy random variables) to derive 
upper tail inequalities for general p-stable distributions and that we can use Lemma [7] (regarding 
Gaussian variables) to derive lower tail inequalities for general p-stable distributions. The following 
two lemmas establish these results; the proofs of these lemmas are provided in Appendix |A.5 and 
Appendix |A.6 , respectively. 



Lemma 9 (Upper Tail Inequality for p-stable Distributions). Given p G (1,2), for i = 1, . . . ,m, 

let Xi be m (not necessarily independent) random variables sampled from T> p , and 7.; > with 
7 = Ylili- Let X = Ylili\Xi\ p ■ Assume that m>3. Then for any t>l, 

^ r ^ 1 21og(mi) 
Pr[X > ta pl ] < ^—L. 
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Figure 1: The CDFs (F(t)) of \X p /2\*> for p = 1.0 (bottom, i.e., red or dark gray), 1.1, . . . , 2.0 (top, 
i.e., yellow or light gray), where X p ~ T> p and the scales of the axes are chosen to magnify the 
upper (as t — > oo) and lower (as t — > 0) tails. These empirical results suggest \X pi /2\ pi y \X p2 /2\ P2 
for all 1 < pi < p 2 < 2. 



Lemma 10 (Lower Tail Inequality for p-stable Distributions). For i = l,...,m, let Xi be in- 
dependent random variables sampled from V p , and 7i > with 7 = Let X = Ti l c i I - 
Then, 

logPr[X< (l-t)/3 p7 ] < ~ : / " 



6 maxj 7, 

Given these results, here is our main result for input-sparsity time low-distortion subspace 
embeddings for t p . The proof of this theorem is similar to the proof of Theorem [2j except that we 
replace the l\ norm || • ||i by || • \\ p and use the tail inequalities from Lemmas [9] and 10 (rather than 
Lemmas [5] and [6j) . 



Theorem 4 (Low-distortion Embedding for £ p ). Given A £ M nxd with full column rank and 
p £ (1,2), let II = SD £ M sxn where S £ M sxn has each column chosen independently and 
uniformly from the s standard basis vectors ofM s , and where D £ M nxn is a diagonal matrix with 
diagonals chosen independently from T> p . Set s = ujd 5 log 5 d with 00 sufficiently large. Then with a 
constant probability, we have 

l/0((dlogd) 2 / p ) • ||Ar|| p < ||IL4x|| p < 0((d\ogd) 1/p ) ■ \\Ax\\ p , Vx £ R d . 

In addition, UA can be computed in 0(nnz(A)) time. 

Similar to the l\ case, our input-sparsity time t p subspace embedding of Theorem [4] improves the 
C(ndlogn)-time embedding of Clarkson et al. jTU]. In addition, by combining Theorem [4] and 
Lemma [3j we can compute a (1 ± e)-distortion embedding in 0(nnz(A) ■ logn) time. 

Theorem 5 ((1 ± e)-distortion Embedding for £ p ). Given A £ M. nxd and p £ [1,2), it takes 
0(pxiz(A) • logn) time to compute a sampling matrix S £ M sxn with s = 0(poly(cQ log(l/e)/e 2 ) 
such that with a constant probability, S embeds A p into (R s , || • [L) with distortion 1 ± e. 
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These improvements for £ p subspace embedding also propagate to related £ p -based applications. In 
particular, we can establish an improved algorithm for solving the £ p regression problem in nearly 
input-sparsity time. 

Corollary 3 (Fast £ p Regression). Given p G (1,2), with a constant probability, a (1 + e)- 
approximate solution to an t v regression problem can be computed in 

0(nnz(A) • logn + T p (e; poly(d) log(l/e)/e 2 , d)) 

time. 

For completeness, we also present a result for low-distortion dense embeddings for £ p that the 
tail inequalities from Lemmas [9] and 10 enable us to construct. See Appendix A. 7 for a proof of 
the following theorem. 

Theorem 6 (Low-distortion Dense Embedding for £ p ). Given A G M nxd with full column rank and 
p G (1, 2), let II G M sxn whose entries are i.i.d. samples from T> p . If s = udlogd for uj sufficiently 
large, with a constant probability, we have 

1/0(1) • \\Ax\\ p < \\UAx\\ p < 0({dlogd) 1/p ) ■ \\Ax\\ p , Vx G R d . 

In addition, HA can be computed in 0(nnz(^4) • dlogd) time. 

Remark. The result in Theorem [6] is based on a dense £ p subspace embeddings that is analogous 
to the dense Gaussian embedding for £2 and the dense Cauchy embedding of [28J for £\. Although 
the running time (if one is simply interested in FLOP counts in RAM) of Theorem [6] is somewhat 
worse than that of Theorem |4j the embedding dimension and condition number quality (the ratio 
of the upper bound on the distortion and the lower bound on the distortion) are much better. Our 
numerical implementations, both with the £\ norm [10] and with the £2 norm [23], strongly suggest 
that the latter quantities are more important to control when implementing randomized regression 
algorithms in large-scale parallel and distributed settings. 



6 Improving the Embedding Dimension 

In Theorem [2] and Theorem [4J the embedding dimension is s = 0(poly(<i) log(l/e)/e 2 ), where the 
poly(ci) term is a somewhat large polynomial of d that directly multiplies the log(l/e)/e 2 term. 
(See the remark below for comments on the precise value of the poly((f) term.) This is not ideal 
for the subspace embedding and the £ p regression, because we want to have a small embedding 
dimension and a small subsampled problem, respectively. Here, we show that it is possible to 
decouple the large polynomial of d and the log(l/e)/e 2 term via another round of sampling and 
conditioning without increasing the complexity. See Algorithm [2] for details on this procedure. 
Theorem [7] provides our main quality-of-approximation result for Algorithm [2j its proof can be 
found in Appendix |A.8[ 

Theorem 7 (Improving the Embedding Dimension). Given p G [1,2), with a constant probability, 
Algorithm^ computes a (1 ± e)-distortion embedding of A p into (MP( d3+P ^ lo &0-/ e )/ e ^ || . i n 
C(nnz(^4) • logn) time. 

Then, by applying Theorem [7] to the £ p regression problem, we can improve the size of the subsam- 
pled problem and hence the overall running time. 
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Algorithm 2 Improving the Embedding Dimension 

Input: A G R nxd with full column rank, p G [1, 2), and e 6 (0, 1). 

Output: A (1 ± e)-distortion embedding 5 G M°( d3+P/2 log(l/e)/e 2 )xn of ^ 

1: Compute a low-distortion embedding ft £ R0(poly(d))xn G f ^ (Theorems |ij and [!} . 

2: Compute R G M dxd from flA such that AR _1 is well-conditioned (QR or Lemma [2 

3: Compute a (1 ± l/2)-distortion embedding S G ^p(pdy{djxn) of ^ (Lemma g. 

4: Compute G M dxd such that K p (SAR~ l ) < 2d (Theorem 

5: Compute a (1 ± e)-distortion embedding S G R°( d3+p/2 i°g(iA)A 2 )x« of ^ ( Lemma 



Corollary 4 (Improved Fast £ p Regression). Given p G [1,2), roi/i a constant probability, a (1 + e)- 
approximate solution to an t v regression problem can be computed in 

0(nnz(,4) • logn + T P (e; d 3+p/2 log(l/e)/e 2 , d)) 

time. The second term comes from solving a subsampled problem of size 0(d? +p l 2 log(l/e)/e 2 ) x d. 

Remark. We have stated our results in the previous sections as poly(d) without stating the value 
of the polynomial because there are numerous trade-offs between the conditioning quality and the 
running time. For example, let p = 1. We can use a rounding algorithm instead of QR to compute 
the R matrix. If we use the input-sparsity time embedding with the 0{d) -rounding algorithm of [10J, 
then the running time to compute the (1 ± e)-distortion embedding is 0{imz{A) ■ logn + d 8 /e 2 ) 
and the embedding dimension is C(<i 6 ' 5 /e 2 ) (ignoring log factors). If, on the other hand, we 
use QR to compute R, then the running time is 0(imz(A) ■ logn + d 7 /e 2 ) and the embedding 
dimension is 0(d s /e 2 ). However, with the result from this section, the running time is simply 
C(nnz(^4) • logn + poly(d) + T p (e; d 3+p / 2 /e 2 , d)) and the poly(d) term can be absorbed by the 
nnz(.A) term. 
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A Appendix 

A.l Proof of Theorem [I] ((1 ± e)-distortion Embedding for £ 2 ) 

Let the n x d matrix U be an orthonormal basis for the range of the n x d matrix A. Rather than 
proving the theorem by establishing that 

(1 - Oll^lh < WJz\\ 2 < (1 + e)\\Uz\\ 2 

holds for all z G M. d , as is essentially done in, e.g., |16| and [11] . we note that U T U = Id-, and we 
directly bound the extent to which the embedding process perturbs this product. To do so, define 



That is, 



Xkl 



x = (uuy (uu) = u 1 d 1 s 1 sdu. 



^2 s ijdjUjk ^2 SijdjUji , k, I e {1, . . . , d}, 



i=i \j=i 



J =1 



where Sij is the (i,j)-th element of S, dj is the j-th diagonal element of D, and Ujk is the (j, k)-th 
element of U. We will use the following facts in the proof: 

Efc^d^] = (Jjua) 

Efsi^s^-J = < i if n = i 2 ,ji = ji, 
if h / i 2 , ji = j 2 . 

We have, 

E [xkl] = E [ s iji d ji u jik ■ Sij 2 dj 2 U j2 i] = } ^y^lsjjUjkUjl] = ^UjkUji = S k l, 



* 3 



and we also have 



kl\ 



E 



J2 ^SijdjUjk ^sijdjUji 
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11,12 



Y s h3 d 3 u jk Yj Si d d i u Ji Y 

Si 2 jdjUjk J J ^ ] Sj 2 jdjiiji 



y ] Y] ^A s hh d 3l U hk ' s hj2 d j2 U j2l ' s i23Z d 3Z U 3zk • Si 2 j 4 dj 4 Uj 4 i] 



11,12 31,32,33,34 



s hj u jl ' s t2j u jk ' s i2j 

3 

+ ^[ s iiji u jik " s hji u jil ' s iih u hk ' s i232 u j2l] 



H,n \ 3 



h¥=32 



+ ^[ s hji u jik " s hj2 U 32l ' s i23i U h k ' s i232 U 32l\ 

h+32 

3i¥=32 , 
Y U % U % + Y U hk U hl U 32kUj 2 l + " ^ u2 hkU 2 j2 l + \ Y U hk u 32l u 32kUj 1 l 



3l¥=32 
2 



3i¥=32 



3i¥=32 



5>«j + \ {^Ytf^j ^ 

_ fi + ?(i-||t^ fc ||4) Mk = l, 
-\l(l-2(U^U^) iik^l. 

By Chebyshev's inequality, it follows that 



Y u l + Yw ~ 2 Y 



2 2 
U jk u jl 



Pr[|x ti - W >*]<^ = *^<{| 



if A; = Z, 



Denote by £ the event |X — 7|oo < where recall that | • |oo denotes the element-wise loo norm of 
a matrix. Note that X is symmetric. We have, 



Pr[£] >l-J]Pr [|x fcI -<5 w | > 
fc</ 

v ^ 2d 2 v ^ d 2 

se 2 se 2 
k k<l 



= 1 - 



2d 3 + d ^ 



If we set s = (d A + 3<i 3 )/e 2 , then we have Pr[£] > 0.5. By Gershgorin's circle theorem, £ implies 
l-e< A(X) < 1 + e, i.e., 



which is equivalent to 



(1 - e)||z||! < z T Xz < (1 + e)||z|||, Vz G R rf , 



s/T^~e\\z\\ 2 < \\nUz\\ 2 < s/TTe\\z\\ 2 , Vz G M d , 
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which, since 1 — e < \/I — e and 1 + e > y/1 + e for any e G (0, 1), implies that 

(1 - e)||z|| 2 < ||nttz|| 2 < (1 + e)||*|| 2 , Vz G R d . 

Therefore, £ implies 

(1 - e)\\Ax\\ 2 < \\RAx\\ 2 < (1 + e)||Ax|| 2 , Vx G M d , 
which concludes the proof. 

A. 2 Proof of Theorem [2] (Low-distortion Embedding for i\) 

We start with the following result, which establishes the existence of the so-called Auerbach's basis 
of a (i-dimensional normed vector space. For our proof, we will only need its existence and not an 
algorithm to construct it. 

Lemma 11. (Auerbach ]2j) Let (A, \\ ■ ||) be a d- dimensional normed vector space. There exists a 
basis {ei, . . . , e^} of A, called Auerbach basis, such that \\ef.\\ = 1 and \\e k \\* = 1 for k = l,...,d, 
where {e 1 , . . . ,e n } is a basis of A* dual to {ei, . . . ,e n }. 

This Auerbach's lemma implies that a (d,l, l)-conditioned basis matrix of .4.1 exists, which will 
be denoted by U throughout the proof. By definition, C7's columns are unit vectors in the i\ 
norm (thus \U\i = d, where recall that | • |i denotes the element- wise l\ norm of a matrix) and 
H^Hoo < ||i/a;||i, Vx G M. d . Denote by Uj the j-th row of U, j = 1, . . . , n. Define Vj = \\uj\\i the i\ 
leverage scores of A. We have ^ • Vj = \U\i = d. Let r > to be determined later, and define two 
index sets H = {j \ Vj > r} and L = {j \ vj < r}. It is easy to see that \H\ < ^ where | • | is used 
to denote the size of a finite set, and || 

v 1 1 00 ^ t where 

v? = \ > 3 = l,...,n. 

10, otherwise 

Similarly, when an index set appears as a superscript, we mean zeroing out elements or rows that 
do not belong to this index set, e.g., v L and U . Define 

Y = {y G R n I y = Ux, {{x^ = 1, x G R d }. 

For any y = Ux G Y, we have ||y||i = > \\x\\oo = 1) 

\yj\ = \ujx\ < \\uj\\i\\x\\oc = vj, j = 1, . . . ,n, 

and thus ||y||i < = d. Define Y L = {y G Y \ \\y L \\\ > |||y||i} and Y H = Y\Y L . Given S, 
define a mapping (f> : {1, . . . , n} — > {1, . . . , s} such that s^(j\ j = 1, j = 1, . . . , n, and split L into 
two subsets: L = {j G L | <p(j) G <t>(H)} and L = L\L. Consider these events: 

• Su' |IIf7|i < widlogd for some wi > 0. 

• £l- ||5''U L || 00 < ^/(rflogd) for some lj 2 > 0. 

• £h- 4>{ji) 4>(j2), Vji / j 2 , ji,j 2 G H. 
J je\H\ 



£(j: minjgi^i |cj| > ^3 / {d? log 2 d) for some W3 > 0. 
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• £~ L : \UU L \i < cj 4 /(ePlog 2 d) for some w 4 > 0. 

Recall that we set s = uod 5 log 5 d in Theorem [2| We will show that, with cj sufficiently large 
and proper choices of coi, 0J3, and ^4, the event £ u leads to an upper bound of ||IL/||i for all 
y G range(A), £ u and <?l lead to a lower bound of ||IL/||i for all y G Y L with probability at least 
0.9, and £h, £ l, and £c together imply an lower bound of ||IL/||x for all y G Y H . 

Lemma 12. Provided £\j , we have 

||IIy||i < widlogd • ||y||i, Vy G range(^l). 
Proof. For any y G range(^4), we can find an x such that y = Ux. Then, 

||n?/||i = ||m7x||i < |IIZ7|x||xc||oo < |IU7|i||r7x||i < uJid\ogd- ||y||i. 



Lemma 13. Provided £l, for any fixed y G Y , we /iaue 



log Pr 



|ny| 



< 



2/1 1 



< 



dlogd 
' 24w 2 ' 



Proof. Let z = Ily. We have, 



J2 s M 1*1 - H si M 



7i\Ci 



□ 



where {c"j} are independent Cauchy variables. Let 7 = X^7« = lly^lli- Since \y\ < v, we have 
7i < ||<St' i ||oc- By Lemma ml 



log Pr 



X < 



\y L h 



< 



-\\y L h 

l2\\Sv L \\ n 



By assumption £l and ||y L ||i > 3II2/H1 > 5, we obtain the result. 
Lemma 14. Assume both £jj and£i- If ^1 and uj^ satisfy 

dlog (6d(l + 4wi dlogd)) - dl °^ d < log 5 

for some 5 G (0, 1) regardless of d, then, with probability at least 1 — 5, we have 



□ 



\\Tly\\i > -IMIi, VyGY L . 

Proof. Set e = 1/(2 + 8widlogd) and create an e-net Y e L C Y L such that for any y G Y L , we can 
find a y e G Y^r such that \\y — y e ||i < e. Since ||y||i < d for all y G Y L , there exist such an e-net 
with at most (3d/e) d elements (Bourgain et al. [1]). By Lemma 13 we can apply a union bound 
for all the elements in Y f L : 



PrfllnyJi > -H^IIj, Vy £ G ] > 1 - I — ) e~ W = 1 - e dl °" 



3d d log d 

24^ 2 > I — S. 
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For any y G Y L , we have, noting that y — y e G range (A), 

||ny||i > ||IIy e ||i - ||n(y- y e )\\i > -\\y e \\i -uidlogd- ||y-y e ||i 

- i^H 1 ~~ (i + UJ i dl °g d ^J e > ^llylli- 
So we establish a lower bound for all y G Y L . □ 
Lemma 15. Provided £h and£^, if W3 > 4^4, we /iaue 

Hllyll! > ^VrlHIl, VyG^. 
cr log a 

Proof. For any ?/ = i/z G y^, we have, 

||n y ||i > ||n( y H + /)||! > ||n y H ||i - nnc^sUi, 

> |c,-||w,-| - in^lilblloo > min|c,-|||w H ||i - \UU L U 

1^3 m \ || || ^4 11 11 

2d 2 log 2 d <Plog 2 d) " yl11 " d 2 log 2 (i ' l|y|11 ' 



which creates a lower bound for all y G y . □ 

We continue to show that, with uj sufficiently large, by setting r = a; 1//4 /(dlog 2 d) and choosing 
coi, W2, W3, and W4 properly, we have each event with probability at least 1 — 0.08 = 0.92 and thus 

Pr[£un£ L nSHn£ L n£c] > 0.6. 



Moreover, the condition in Lemma [14] holds with 5 = 0.1, and the condition in Lemma 15 holds. 
Therefore, II = SC has the desired property with probability at least 0.5, which would conclude 
the proof of Theorem [2] 

Lemma 16. With probability at least 0.92, £jj holds with lj% = 500(1 + logo;). 
Proof. With S fixed, we have, 

dsn dsn 

\uu\i = \scu\i = y^Yl X! • s ' ; ' r '"' / ' - "52^2^2 (\ s ij u jk\) \cik\, 

k=l i=l j = l k=l i=l j = l 

where {dik} are dependent Cauchy random variables. We have 

dsn d n 

J2J2J2\ si 3 u 3k\ = J2J2\ u Jk\ = \u\i = d. 

k=l i=l j=l fc=l j=l 

Apply Lemma [5j 

Prfin^ > < 21 °^. 

Setting u)\ = 500(1 + logo;) and t = uj\ logd, we have 

21og(sdt) = 21og(tW 6 log 5 d) < Q Q8 
t ui\ log d 

We assume that log d > 1 and log cu > 1 . □ 
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Lemma 17. For any 5 E (0,0.1), if s > d/r, we have, 



Pr 



ISv 1 



> 1 + 2 log 



St 



< S. 



Proof Let = s i:j vf. We have E[Xy] = vf/s, E[X?-] = (vf) 2 /a, and < < vf < r. Fixed 
i, Xij are independent, j = 1, . . . , n. By Bernstein's inequality, 



log Pr 



< 



-t 2 /2 



< 



-t 2 /2 



-t 2 /(2r) 

u L |||/s + rt/3 ~ r(||v L ||i/s + t/3) ~ d/s + t/3 



< 



where we use Holder's inequality: Hu^Hl < H^llill^Hoo < <^ T - To obtain a union bound for all i 
with probability 1 — 5, we need 



-*7(2t) 
d/s + t/3 



+ log s < log 5. 



Given 5 < 0.1, it suffices to choose s = d/r and t = 2 log(cZ/(5r))r. Note that ||u ||i/s < ||i>||i/s 
r. We have 



Pr 



■SV^L > | l + 21og— | ■ r 



< <5. 



Increasing s will decrease the failure rate, so it holds for all s > d/r. 

Lemma 18. With probability at least 0.92, Sl holds with 0J2 = (15 + logu^/a; 1 / 4 . 



□ 



Proof. By Lemma 17 with probability at least 0.92, El holds with 

1 i 1 1 „ u 1/4 d 2 log 2 d 1 tr , i 

l + 21og 08 s 15 + log oj 

0J2 = — : <■ 



wV4 Wei 



OJ 



1/4 



□ 



Lemma 19. With the above choices of oj\ and oji, the condition in Lemma IS holds with 5 = 0.1 
for sufficiently large oj. 

Proof. With u>i = 500(1 + logo;), and 0J2 = (15 + logw)/w 1//4 , the first term in 

dlog d 



dlog (6d(l + 4widlog d)) 



24w 2 



increases much slower than the second term as oj increases, while both are at the order of dlogd. 
Therefore, if oj is sufficiently large, the condition hold with 5 = 0.1. □ 

Lemma 20. If oj > 160, event En holds with probability at least 0.92. 

Proof. Given G H and j\ ^ j'2, let Xj 1 j 2 = 1 if (f>(ji) = 4>{j2) and Xj 1 j 2 = otherwise. It is 
easy to see that Pi[X j 1 j 2 = 1] = 1 Therefore, 

Pr[£ H ] >1- J2 P *l X hh =1]>1-^>1-^>1-^. 



It suffices if oj > 160. 



□ 
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Lemma 21. With probability at least 0.92 ; event £c holds with ui^, = l/(8a; 1 / 4 ). 
Proof. Let c be a Cauchy variable. We have 

t, r, 1 l 2 1 2t 

Pr[|c| < t] = -tan -1 * < — . 

7T 7T 

is at most d/r = uj^l^d 2 log 2 d. Then 

Pr[£ c ] > 1 - \H\ ■ Pr |r| < 



w 3 



d 2 log 2 d. 



>l-^2 log 2 d . 2^3 

7rd 2 log z d 



Therefore, = l/(8u; 1 / 4 ) would suffice. 



□ 



Lemma 22. With probability at least 0.92, event £ £ holds with 0J4 = 25000(1 + logo;) /a; 3 / 4 . Thus 
with uj sufficiently large and the above choice 0/003, the condition in LemmaU^oj^ > Auj^ holds. 



Proof. We have, 



E[\U L U] = — \U L \! < 
s 



V4d 2 log 2 d 



By Markov's inequality, 



Pr 



\U L \i > 



cod 5 log 5 d 
25 



■d 



u 3/i d 2 log 3 d 



3 / 4 d 2 log 3 d. 



< 0.04. 



Assume that It/^li < 



25 



w 3 / 4 d 2 log 3 d 
d 



. Similar to the proof of Lemma 



16 



we have 



|n[/ L |i 



E E IE 

k=l i&4>(H) j 



SijCjUj k \ 



E E E 



Sij\ u jk\ I l^ifcl 



k=l i£0(if) 

where {c^} are dependent Cauchy variables. Apply Lemma [5j 



Pr[|I177 L | > \U L \t] < 



2log(\H\dt) 



It suffices to choose t = 1000(1 + logo;) logd to make the RHS less than 0.04. So with probability 
at least 0.92, we have £ L holds with w 4 = 25000(1 + logw)/a; 3 / 4 . □ 

A. 3 Proof of Corollary [2] (Fast i\ Regression) 

By Theorem [2] and Lemma [3j we know that Steps 2 and 4 of Algorithm [T] succeed with a constant 
probability. Conditioning on this event, we have 



\\Ax-b\\ x < 



1 



l-e/4 



l + e/4 (l + e/4) 2 

\SAx-Sb\\! < , ', \\SAx*-Sb\\i < \ ',! ||Ac*-bl|i < (l+e)||Ar*-6||i, 



l-e/4 1 



l-e/4 



where the last inequality is due to e < 1/2. By Theorem [2j Step 2 takes 0(nnz(A)) time, and 
Step 3 takes 0(poly(d)) time because 11^4 has C(poly(d) rows. Then, by Lemma [3j Step 4 takes 
0(nxiz(A) • logn) time, and Step 5 takes 7~i(e/4; 0(poly(d) log(l/e)/e 2 ), d) time. Therefore, the 
total running time of Algorithm [T] is as stated. 
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A. 4 Proof of Lemma |8] 

First, we know that 

Pr[|X p | p >t} = Pr[\X p \ > t 1/p ] = 2 • Pr[X p > t 1/p ]. 
Next, we state the following lemma, which is due to Nolan |25j . 
Lemma 23. (Nolan [25, Thm. 1.12]) Let X ~ V p with p £ [1, 2). Then as x oo, 

Pr[X > x] ~ c pX - p , 

where c p = sin t? • T{p)/ir. 



By Lemma 23, it follows that, as i — >• oo, 

Pr[\X p \ p > t] ~ 2 C pt" 1 . 

For the Cauchy distribution, we have 

PrNCI > il = 1 - 2 tan- 1 t = 2 tan" 1 1 ~ 2 • r 1 . 

7T 7T t 7T 

Hence, there exist a' p > and ti > such that for all t > t±, 

Pr[a' p \C\ >t] >Pr[\X p \P >t}. 

Note that all the p-stable distributions with p £ [1)2] have finite and positive density at x = 0. 
Therefore, there exists a p > such that for all < t < t%, 

Pr[<|C| >t]> Pr[\X p \ p > t]. 

Let q p = max{a' a''}. We get a p \C\ >z \X P \ P . For the Gaussian distribution, we have, as t — > oo, 

Pr[|G| 2 > t] ~ 2e~'/V 1 / 2 . 

which converges to zero much faster than t , so we can apply similar arguments to obtain f3 p . 

A. 5 Proof of Lemma [9] (Upper Tail Inequality for p-stable Distributions) 

Let Ci = F~ 1 (F p (Xi)), i = 1, . . . ,m, where F c is the CDF of the standard Cauchy distribution 
and F p is the CDF of T> p . Ci follows the standard Cauchy distribution, and, by Lemma |8j we have 
ap|Ci| > \Xi\ p - Therefore, for any t > 1, 



Pi[X > ta p ~f] < Pr 
The last inequality is from Lemma [5] 



< 



2 \og{mt) 



A. 6 Proof of Lemma 10 (Lower Tail Inequality for p-stable Distributions) 



Let Gi be independent random variables sampled from the standard Gaussian distribution, i 
1, . . . , m. By Lemma [8l we have 



logPrLY < P p (l - t)j] < logPr 



^7 4 |G,| 2 <(l-t) 7 



The lower tail inequality from Lemma [7] concludes the proof. 
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A. 7 Proof of Theorem [6] (Low-distortion Dense Embedding for £ p ) 

The proof is similar to the proof of Sohler and Woodruff [28, Theorem 5], except that the Cauchy 
tail inequalities are replaced by tail inequalities for the stable distributions. For simplicity, we omit 
the complete proof but show where to apply those tail inequalities. By Lemma [TTJ there exists a 
(<i 1//p , l,p)-conditioned basis matrix of A p , denoted by U. Thus, \U\ P = d, where recall that | • \ p 
denotes the element-wise £„ norm of a matrix. We have, 



\nu\ p P 



E« = EE 



k=l 



k=l 1=1 



E 

3=1 



n, 



Ujk 



EE Iwlpl-Xifcl 

k=l i=l 



where X^ 
Define Y = 



V p . Applying Lemma 9l we get 



{Ux\ 



\nu\\$/a 



O(dlogd) with a constant probability. 



\x\ 



1,161 }. For any fixed y 6 Y, we have 



lin»n; 



E 

i=l 



E Ui i y i 



Eii^iipi^i p ' 



where Xj 



iid „ 



V 



Applying Lemma 



9, we get ||ily||p/s < 1/0(1) with an exponentially small 
probability with respect to s. By choosing s = udlogd with u sufficiently large and an e-net 
argument on Y, we can obtain a union lower bound of ||ny||p on all the elements of Y with a 
constant probability. Then, 



1/0(1) • \\v\\l < \\m\ P P /s < \nU\ P P \M P q < O(dlogd) • \\Ux\\* = 0(dlogd)\\y\\ p p , y G Y, 
which gives us the desired result. 



A. 8 Proof of Theorem [7] (Improving the Embedding Dimension) 

Each of Steps 1,3, and 5 of Algorithm [2] succeeds with a constant probability and we can control the 
failure rate by repeating trials. Thus, with a constant probability, all steps succeed. Conditioning 
on this event, we have ^(AR^ 1 ) = Qd because 

||AR -1 x||p < 2||,S'A?r 1 :r||p < 4d\\x\\ 2 , 

2 2 
WAR^xWp > -WSAR^xWp > -\\x\\ 2 , Vx G M d . 

o o 

By Lemma [TJ R p (AR~ l ) < Gd 1 ^^ 1 , and then by Lemma [3J the embedding dimension of S is 
0{4(AR~ 1 )d\p/ 2 ' 1 \dlog(l/e)/e 2 ) = O (d 3 +P '/ ' 2 log(l/e)/e 2 ). 
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